1) Importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn import metrics
import matplotlib.pyplot as plt

#importing seaborn for statistical plots
import seaborn as sns
sns.set(color_codes = True)

# To enable plotting graphs in Jupyter notebook
%matplotlib inline

import scipy.stats as stats
import statsmodels.api as statm

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

from scipy.stats import zscore

2) Reading the data as a data frame

In [2]:
vehicle = pd.read_csv("vehicle.csv")
vehicle.head()
Out[2]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [3]:
vehicle.tail()
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
841 93 39.0 87.0 183.0 64.0 8 169.0 40.0 20.0 134 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195 car
842 89 46.0 84.0 163.0 66.0 11 159.0 43.0 20.0 159 173.0 368.0 176.0 72.0 1.0 20.0 186.0 197 van
843 106 54.0 101.0 222.0 67.0 12 222.0 30.0 25.0 173 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201 car
844 86 36.0 78.0 146.0 58.0 7 135.0 50.0 18.0 124 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195 car
845 85 36.0 66.0 123.0 55.0 5 120.0 56.0 17.0 128 140.0 212.0 131.0 73.0 1.0 18.0 186.0 190 van

3) Data Analysis and Preparation

Shape of the data

In [4]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder 
le = LabelEncoder() 
columns = vehicle.columns
#Let's Label Encode our class variable: 
print(columns)
vehicle['class'] = le.fit_transform(vehicle['class'])
vehicle.shape
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio', 'class'],
      dtype='object')
Out[4]:
(846, 19)
In [5]:
vehicle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  841 non-null    float64
 2   distance_circularity         842 non-null    float64
 3   radius_ratio                 840 non-null    float64
 4   pr.axis_aspect_ratio         844 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                845 non-null    float64
 7   elongatedness                845 non-null    float64
 8   pr.axis_rectangularity       843 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              843 non-null    float64
 11  scaled_variance.1            844 non-null    float64
 12  scaled_radius_of_gyration    844 non-null    float64
 13  scaled_radius_of_gyration.1  842 non-null    float64
 14  skewness_about               840 non-null    float64
 15  skewness_about.1             845 non-null    float64
 16  skewness_about.2             845 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    int32  
dtypes: float64(14), int32(1), int64(4)
memory usage: 122.4 KB

Data type of each attribute

In [6]:
vehicle.dtypes
Out[6]:
compactness                      int64
circularity                    float64
distance_circularity           float64
radius_ratio                   float64
pr.axis_aspect_ratio           float64
max.length_aspect_ratio          int64
scatter_ratio                  float64
elongatedness                  float64
pr.axis_rectangularity         float64
max.length_rectangularity        int64
scaled_variance                float64
scaled_variance.1              float64
scaled_radius_of_gyration      float64
scaled_radius_of_gyration.1    float64
skewness_about                 float64
skewness_about.1               float64
skewness_about.2               float64
hollows_ratio                    int64
class                            int32
dtype: object

Summary Statistics

In [7]:
vehicle.describe().T
Out[7]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
class 846.0 0.977541 0.702130 0.0 0.00 1.0 1.0 2.0

Observation:

  • Compactness has mean and median values almost similar , it signifies that it is normally distribited and has no skewness/outlier

  • circularity : it also seems to be normally distribted as mean amd median has similar values

  • scatter_ratio feature seems to be having some kind of skewness and outlier

Checking missing values (Data Cleaning)

In [8]:
vehicle.isna().apply(pd.value_counts) # checking missing value
Out[8]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
False 846.0 841 842 840 844 846.0 845 845 843 846.0 843 844 844 842 840 845 845 846.0 846.0
True NaN 5 4 6 2 NaN 1 1 3 NaN 3 2 2 4 6 1 1 NaN NaN
  • As we can see most of the columns has missing values.
In [9]:
from sklearn.impute import SimpleImputer
newdf = vehicle.copy()

X = newdf.iloc[:,0:19] #separting all numercial independent attribute

#imputer = SimpleImputer()
imputer = SimpleImputer(missing_values=np.nan, strategy='median', verbose=1)
#fill missing values with mean column values
transformed_values = imputer.fit_transform(X)
column = X.columns
print(column)
newdf = pd.DataFrame(transformed_values, columns = column )
newdf.describe().T
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio', 'class'],
      dtype='object')
Out[9]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.00 119.0
circularity 846.0 44.823877 6.134272 33.0 40.00 44.0 49.00 59.0
distance_circularity 846.0 82.100473 15.741569 40.0 70.00 80.0 98.00 112.0
radius_ratio 846.0 168.874704 33.401356 104.0 141.00 167.0 195.00 333.0
pr.axis_aspect_ratio 846.0 61.677305 7.882188 47.0 57.00 61.0 65.00 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.00 55.0
scatter_ratio 846.0 168.887707 33.197710 112.0 147.00 157.0 198.00 265.0
elongatedness 846.0 40.936170 7.811882 26.0 33.00 43.0 46.00 61.0
pr.axis_rectangularity 846.0 20.580378 2.588558 17.0 19.00 20.0 23.00 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.00 188.0
scaled_variance 846.0 188.596927 31.360427 130.0 167.00 179.0 217.00 320.0
scaled_variance.1 846.0 439.314421 176.496341 184.0 318.25 363.5 586.75 1018.0
scaled_radius_of_gyration 846.0 174.706856 32.546277 109.0 149.00 173.5 198.00 268.0
scaled_radius_of_gyration.1 846.0 72.443262 7.468734 59.0 67.00 71.5 75.00 135.0
skewness_about 846.0 6.361702 4.903244 0.0 2.00 6.0 9.00 22.0
skewness_about.1 846.0 12.600473 8.930962 0.0 5.00 11.0 19.00 41.0
skewness_about.2 846.0 188.918440 6.152247 176.0 184.00 188.0 193.00 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.00 211.0
class 846.0 0.977541 0.702130 0.0 0.00 1.0 1.00 2.0
In [10]:
print("Original null value count:", vehicle.isnull().sum())
print("\n\nCount after we impiuted the NaN value: ", newdf.isnull().sum())
Original null value count: compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64


Count after we impiuted the NaN value:  compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64

Distribution of numeric variables

In [11]:
newdf.hist(bins=20, figsize=(60,40), color='lightblue', edgecolor = 'red')
plt.show()

Observations :

  • Most of the data attributes seems to be normally distributed

  • scaled valriance 1 and skewness about 1 and 2, scatter_ratio, seems to be right skewed.

  • pr.axis_rectangularity seems to be haing outliers as there are some gaps found in the bar plot.

Measuring skewness

In [12]:
plt.figure(figsize = (15,10))

plt.subplot(5,4,1)
sns.distplot(newdf['scaled_variance.1'], kde = True, rug = True, color = "coral")

plt.subplot(5,4,2)
sns.distplot(newdf['scaled_variance'], kde = True, rug = True, color = "coral")

plt.subplot(5,4,3)
sns.distplot(newdf['skewness_about.1'], kde = True, rug = True, color = "coral")

plt.subplot(5,4,4)
sns.distplot(newdf['skewness_about'], kde = True, rug = True, color = "coral")

plt.subplot(5,4,5)
sns.distplot(newdf['scatter_ratio'], kde = True, rug = True, color = "coral")


plt.show()
In [13]:
newdf.skew(axis = 0, skipna = True)
Out[13]:
compactness                    0.381271
circularity                    0.264928
distance_circularity           0.108718
radius_ratio                   0.397572
pr.axis_aspect_ratio           3.835392
max.length_aspect_ratio        6.778394
scatter_ratio                  0.608710
elongatedness                  0.046951
pr.axis_rectangularity         0.774406
max.length_rectangularity      0.256359
scaled_variance                0.655598
scaled_variance.1              0.845345
scaled_radius_of_gyration      0.279910
scaled_radius_of_gyration.1    2.089979
skewness_about                 0.780813
skewness_about.1               0.689014
skewness_about.2               0.249985
hollows_ratio                 -0.226341
class                          0.031106
dtype: float64

Visualizing outliers using boxplots

In [14]:
plt.figure(figsize = (15,10))
ax = sns.boxplot(data=newdf, orient="h")
In [15]:
plt.figure(figsize= (15,10))
plt.subplot(5,3,1)
sns.boxplot(x= newdf['pr.axis_aspect_ratio'], color='cyan')

plt.subplot(5,3,2)
sns.boxplot(x= newdf.skewness_about, color='hotpink')

plt.subplot(5,3,3)
sns.boxplot(x= newdf.scaled_variance, color='yellow')

plt.subplot(5,3,4)
sns.boxplot(x= newdf['radius_ratio'], color='teal')

plt.subplot(5,3,5)
sns.boxplot(x= newdf['scaled_radius_of_gyration.1'], color='lightblue')

plt.subplot(5,3,6)
sns.boxplot(x= newdf['scaled_variance.1'], color='lavender')

plt.subplot(5,3,7)
sns.boxplot(x= newdf['max.length_aspect_ratio'], color='lightgrey')

plt.subplot(5,3,8)
sns.boxplot(x= newdf['skewness_about.1'], color='pink')

plt.show()

All of the above boxplots shows outliers, which is visible with all dotted points.

Treating Outliers Using IQR: Upper whisker

In [16]:
from scipy.stats import iqr

Q1 = newdf.quantile(0.25)
Q3 = newdf.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
compactness                     13.00
circularity                      9.00
distance_circularity            28.00
radius_ratio                    54.00
pr.axis_aspect_ratio             8.00
max.length_aspect_ratio          3.00
scatter_ratio                   51.00
elongatedness                   13.00
pr.axis_rectangularity           4.00
max.length_rectangularity       22.00
scaled_variance                 50.00
scaled_variance.1              268.50
scaled_radius_of_gyration       49.00
scaled_radius_of_gyration.1      8.00
skewness_about                   7.00
skewness_about.1                14.00
skewness_about.2                 9.00
hollows_ratio                   10.75
class                            1.00
dtype: float64
In [17]:
cleandf = newdf[~((newdf < (Q1 - 1.5 * IQR)) |(newdf > (Q3 + 1.5 * IQR))).any(axis=1)]
cleandf.shape
Out[17]:
(813, 19)

Let's plot the box plot once again to see if outliers are removed

In [18]:
plt.figure(figsize= (15,10))
plt.subplot(5,3,1)
sns.boxplot(x= cleandf['pr.axis_aspect_ratio'], color='cyan')

plt.subplot(5,3,2)
sns.boxplot(x= cleandf.skewness_about, color='hotpink')

plt.subplot(5,3,3)
sns.boxplot(x= cleandf.scaled_variance, color='yellow')

plt.subplot(5,3,4)
sns.boxplot(x= cleandf['radius_ratio'], color='teal')

plt.subplot(5,3,5)
sns.boxplot(x= cleandf['scaled_radius_of_gyration.1'], color='lightblue')

plt.subplot(5,3,6)
sns.boxplot(x= cleandf['scaled_variance.1'], color='lavender')

plt.subplot(5,3,7)
sns.boxplot(x= cleandf['max.length_aspect_ratio'], color='lightgrey')

plt.subplot(5,3,8)
sns.boxplot(x= cleandf['skewness_about.1'], color='pink')

plt.show()

We can see that all out boxplot for all the attributes which had outlier have been treate and removed. Since no. of outliers were less we opted to remove it. Generally we avoid this as it can lead to info loss in case of large data sets with large no of outliers

4) Understanding the relationship between all independent attribute:

In [19]:
#Let's Drop Class column and see the correlation Matrix & Pairplot Before using this dataframe for PCA 
#as PCA should only be perfromed on independent attribute

corr_df = newdf.drop('class', axis=1)

corr = corr_df.corr()

plt.figure(figsize=(16, 10))
sns.heatmap(corr, annot=True, cmap = "YlGnBu")
plt.show()

Observations:

Strong Correlation:

  • Scaled Variance & Scaled Variance.1 seems to be strongly correlated with value of 0.98

  • skewness_about_2 and hollow_ratio seems to be strongly correlated, corr coeff: 0.89

  • ditance_circularity and radius_ratio seems to have high positive correlation with corr coeff: 0.81

  • compactness & circularity , radius_ratio & pr.axis_aspect_ratio also seems ver averagely correlated with coeff: 0.67.

  • scaled _variance and scaled_radius_of_gyration, circularity & distance_circularity also seems to be highly correlated with corr coeff: 0.79

  • pr.axis_recatngularity and max.length_recatngularity also seems to be strongly correlated with coeff: 0.81

  • scatter_ratio and elongatedness seems to be have strong negative correlation val : 0.97

  • elongatedness and pr.axis_rectangularity seems to have strong negative correlation, val: 0.95

Weak/No Correlation:

  • max_length_aspect_ratio & radius_ratio have average correlation with coeff: 0.5

  • pr.axis_aspect_ratio & max_length_aspect_ratio seems to have very little correlation

  • scaled_radius_gyration & scaled_radisu_gyration.1 seems to be very little correlated

  • scaled_radius_gyration.1 & skewness_about seems to be very little correlated

  • skewness_about & skewness_about.1 not be correlated

  • skewness_about.1 and skewness_about.2 are not correlated.

Pairplot

In [20]:
sns.pairplot(corr_df, diag_kind="kde")
Out[20]:
<seaborn.axisgrid.PairGrid at 0x1d8d202c248>

We found from our pairplot analysis that, Scaled Variance & Scaled Variance.1 and elongatedness and pr.axis_rectangularity are strongly correlated. They need to be treated carefully before we go for model building.

Choosing the right attributes for model building

Our aim is to reocgnize whether an object is a van or bus or car based on some input features. So our main assumption is there is little or no multicollinearity between the features. If two features are highly correlated then there is no point using both features.

From above correlation matrix we can see that many features are there which having more than 0.9 correlation. We can get rid of those columns with correlation +-0.9 or above. There are 8 such columns:

  • max.length_rectangularity
  • scaled_radius_of_gyration
  • skewness_about.2
  • scatter_ratio
  • elongatedness
  • pr.axis_rectangularity
  • scaled_variance
  • scaled_variance.1

We can pick one of the two highly correalated variables and drop another one. For example, Scaled Variance & Scaled Variance.1 are having strong positive correlation, so we can pick one and drop one as they will only make our dimension redundant.

Similarly between elongatedness and pr.axis_rectangularity we can pick one as they have very strong negative correlation.

5) Distribution of target column

In [21]:
#display how many are car,bus,van. 
print(cleandf['class'].value_counts())

sns.countplot(cleandf['class'])
plt.show()
1.0    416
0.0    208
2.0    189
Name: class, dtype: int64

6) Principal Component Analysis(PCA):

In [22]:
# we separate the target variable (class) and save it in the y variable. Also the X contains the independant variables.
X = cleandf.iloc[:,0:18].values
y = cleandf.iloc[:,18].values
In [23]:
# scaling the data using the standard scaler
from sklearn.preprocessing import StandardScaler
XScaled = StandardScaler().fit_transform(X)
In [24]:
# generating the covariance matrix and the eigen values for the PCA analysis
cov_matrix = np.cov(XScaled.T) # the relevanat covariance matrix
print("Covariance Matrix shape:",cov_matrix.shape)
print("Covariance Matrix\n", cov_matrix)
Covariance Matrix shape: (18, 18)
Covariance Matrix
 [[ 1.00123153e+00  6.80164027e-01  7.87792814e-01  7.46906930e-01
   2.00881439e-01  4.98273207e-01  8.11840645e-01 -7.89531434e-01
   8.12866245e-01  6.74996601e-01  7.92438680e-01  8.13494150e-01
   5.78399755e-01 -2.53990635e-01  2.00887113e-01  1.61304844e-01
   2.95777412e-01  3.64608943e-01]
 [ 6.80164027e-01  1.00123153e+00  7.87747162e-01  6.41725205e-01
   2.06409699e-01  5.64854067e-01  8.44804611e-01 -8.16768295e-01
   8.41196310e-01  9.62404205e-01  8.03750964e-01  8.33508154e-01
   9.26281607e-01  6.67790806e-02  1.40563881e-01 -1.43598307e-02
  -1.16976151e-01  3.92302597e-02]
 [ 7.87792814e-01  7.87747162e-01  1.00123153e+00  8.09326627e-01
   2.45756551e-01  6.69657073e-01  9.06692225e-01 -9.09806087e-01
   8.95884623e-01  7.69635504e-01  8.85221631e-01  8.89286924e-01
   7.03348558e-01 -2.38231284e-01  9.89345733e-02  2.63832735e-01
   1.29070982e-01  3.22051625e-01]
 [ 7.46906930e-01  6.41725205e-01  8.09326627e-01  1.00123153e+00
   6.67029240e-01  4.61258592e-01  7.90495472e-01 -8.45064567e-01
   7.64769672e-01  5.77501217e-01  7.93778346e-01  7.77097647e-01
   5.51222677e-01 -4.03672885e-01  4.03555670e-02  1.87420711e-01
   4.18869167e-01  5.05314324e-01]
 [ 2.00881439e-01  2.06409699e-01  2.45756551e-01  6.67029240e-01
   1.00123153e+00  1.38431761e-01  2.00217560e-01 -3.02289321e-01
   1.69961019e-01  1.46036511e-01  2.15074904e-01  1.86526180e-01
   1.53697623e-01 -3.25502385e-01 -5.16026240e-02 -2.86185855e-02
   4.06792617e-01  4.20318003e-01]
 [ 4.98273207e-01  5.64854067e-01  6.69657073e-01  4.61258592e-01
   1.38431761e-01  1.00123153e+00  4.98078976e-01 -5.02996017e-01
   4.97845069e-01  6.48642021e-01  4.12068816e-01  4.58456162e-01
   4.04786322e-01 -3.33161873e-01  8.41082601e-02  1.41145578e-01
   5.64852182e-02  3.94934461e-01]
 [ 8.11840645e-01  8.44804611e-01  9.06692225e-01  7.90495472e-01
   2.00217560e-01  4.98078976e-01  1.00123153e+00 -9.73537513e-01
   9.90659730e-01  8.08063766e-01  9.78751548e-01  9.94204811e-01
   7.95893849e-01  2.44702588e-03  6.35490363e-02  2.14445853e-01
  -3.10409338e-03  1.16323654e-01]
 [-7.89531434e-01 -8.16768295e-01 -9.09806087e-01 -8.45064567e-01
  -3.02289321e-01 -5.02996017e-01 -9.73537513e-01  1.00123153e+00
  -9.51112661e-01 -7.70982661e-01 -9.66090990e-01 -9.56973892e-01
  -7.63345981e-01  8.70842667e-02 -4.55135596e-02 -1.84181395e-01
  -1.05393355e-01 -2.11345600e-01]
 [ 8.12866245e-01  8.41196310e-01  8.95884623e-01  7.64769672e-01
   1.69961019e-01  4.97845069e-01  9.90659730e-01 -9.51112661e-01
   1.00123153e+00  8.11346565e-01  9.64981168e-01  9.88989478e-01
   7.93172901e-01  1.77904437e-02  7.28156271e-02  2.16892797e-01
  -2.65026808e-02  9.80719286e-02]
 [ 6.74996601e-01  9.62404205e-01  7.69635504e-01  5.77501217e-01
   1.46036511e-01  6.48642021e-01  8.08063766e-01 -7.70982661e-01
   8.11346565e-01  1.00123153e+00  7.50600479e-01  7.95049173e-01
   8.68007898e-01  5.26495142e-02  1.34795631e-01 -2.44448372e-03
  -1.17812145e-01  6.72596198e-02]
 [ 7.92438680e-01  8.03750964e-01  8.85221631e-01  7.93778346e-01
   2.15074904e-01  4.12068816e-01  9.78751548e-01 -9.66090990e-01
   9.64981168e-01  7.50600479e-01  1.00123153e+00  9.76750881e-01
   7.81984129e-01  1.68621531e-02  3.39888849e-02  2.05971428e-01
   2.28035846e-02  9.60435931e-02]
 [ 8.13494150e-01  8.33508154e-01  8.89286924e-01  7.77097647e-01
   1.86526180e-01  4.58456162e-01  9.94204811e-01 -9.56973892e-01
   9.88989478e-01  7.95049173e-01  9.76750881e-01  1.00123153e+00
   7.90805725e-01  1.62348310e-02  6.49567636e-02  2.03838067e-01
   7.85566308e-05  1.03330899e-01]
 [ 5.78399755e-01  9.26281607e-01  7.03348558e-01  5.51222677e-01
   1.53697623e-01  4.04786322e-01  7.95893849e-01 -7.63345981e-01
   7.93172901e-01  8.68007898e-01  7.81984129e-01  7.90805725e-01
   1.00123153e+00  2.16651698e-01  1.68973862e-01 -5.83635746e-02
  -2.32617810e-01 -1.20727281e-01]
 [-2.53990635e-01  6.67790806e-02 -2.38231284e-01 -4.03672885e-01
  -3.25502385e-01 -3.33161873e-01  2.44702588e-03  8.70842667e-02
   1.77904437e-02  5.26495142e-02  1.68621531e-02  1.62348310e-02
   2.16651698e-01  1.00123153e+00 -5.93373719e-02 -1.31142620e-01
  -8.43627948e-01 -9.18420730e-01]
 [ 2.00887113e-01  1.40563881e-01  9.89345733e-02  4.03555670e-02
  -5.16026240e-02  8.41082601e-02  6.35490363e-02 -4.55135596e-02
   7.28156271e-02  1.34795631e-01  3.39888849e-02  6.49567636e-02
   1.68973862e-01 -5.93373719e-02  1.00123153e+00 -4.53538836e-02
   8.48972195e-02  6.12111362e-02]
 [ 1.61304844e-01 -1.43598307e-02  2.63832735e-01  1.87420711e-01
  -2.86185855e-02  1.41145578e-01  2.14445853e-01 -1.84181395e-01
   2.16892797e-01 -2.44448372e-03  2.05971428e-01  2.03838067e-01
  -5.83635746e-02 -1.31142620e-01 -4.53538836e-02  1.00123153e+00
   7.28908031e-02  2.00156475e-01]
 [ 2.95777412e-01 -1.16976151e-01  1.29070982e-01  4.18869167e-01
   4.06792617e-01  5.64852182e-02 -3.10409338e-03 -1.05393355e-01
  -2.65026808e-02 -1.17812145e-01  2.28035846e-02  7.85566308e-05
  -2.32617810e-01 -8.43627948e-01  8.48972195e-02  7.28908031e-02
   1.00123153e+00  8.91041674e-01]
 [ 3.64608943e-01  3.92302597e-02  3.22051625e-01  5.05314324e-01
   4.20318003e-01  3.94934461e-01  1.16323654e-01 -2.11345600e-01
   9.80719286e-02  6.72596198e-02  9.60435931e-02  1.03330899e-01
  -1.20727281e-01 -9.18420730e-01  6.12111362e-02  2.00156475e-01
   8.91041674e-01  1.00123153e+00]]
In [25]:
#generating the eigen values and the eigen vectors
e_vals, e_vecs = np.linalg.eig(cov_matrix)
print('Eigenvectors \n%s' %e_vecs)
print('\nEigenvalues \n%s' %e_vals)
Eigenvectors 
[[-2.72251046e-01 -8.97284818e-02  2.26045073e-02  1.30419032e-01
  -1.52324139e-01  2.58374578e-01 -1.88794221e-01 -7.71578238e-01
  -3.61784776e-01 -1.25233628e-01  2.92009470e-02  7.62442008e-04
  -1.06680587e-02  1.05983722e-02 -1.01407495e-01 -1.46326861e-01
  -3.81638532e-03  3.32992130e-03]
 [-2.85370045e-01  1.33173937e-01  2.10809943e-01 -2.06785531e-02
   1.39022591e-01 -6.88979940e-02  3.90871235e-01 -6.60528436e-02
  -4.62957583e-02  2.40262612e-01  7.29503235e-02  1.93799916e-01
  -7.74670931e-03 -8.71766559e-02 -3.11337823e-01  1.96463651e-01
  -2.96230720e-01  5.83996136e-01]
 [-3.01486231e-01 -4.40259591e-02 -7.08780817e-02  1.07425217e-01
   8.07335409e-02 -2.04800896e-02 -1.76384547e-01  2.98693883e-01
  -2.64499195e-01 -9.42971834e-02  7.78755026e-01 -2.32649049e-01
   1.11905744e-02  2.28724292e-02  5.89166755e-02  5.33931974e-02
   9.72735293e-02  8.64160083e-02]
 [-2.72594510e-01 -2.04232234e-01 -4.02139629e-02 -2.52957341e-01
  -1.19012554e-01 -1.39449676e-01 -1.56474448e-01  5.20410402e-02
  -1.70430331e-01  8.97062530e-02 -1.31647081e-01  2.75143903e-01
  -3.74689248e-02  2.90668794e-02 -2.04574984e-01  6.58916577e-01
   2.74900989e-01 -2.71300494e-01]
 [-9.85797647e-02 -2.59136858e-01  1.14805227e-01 -6.05228001e-01
  -8.32128223e-02 -5.87145492e-01 -1.02492950e-01 -1.61872497e-01
   1.17212341e-02  2.87528583e-02  4.97534613e-02 -1.45558629e-01
   2.09842091e-02 -9.40948646e-03  1.50893891e-01 -2.89610835e-01
  -1.19100067e-01  9.64017331e-02]
 [-1.94755787e-01 -9.45756320e-02  1.39313484e-01  3.22531411e-01
   6.21376071e-01 -2.65624695e-01 -3.98851794e-01 -5.85800952e-02
   1.73213170e-01 -2.49937617e-01 -1.98444456e-01  1.72600201e-01
  -1.06888298e-02  1.20980507e-02  1.76055013e-01  6.68511988e-02
  -2.92959443e-02  1.10841470e-01]
 [-3.10518442e-01  7.23350799e-02 -1.12924698e-01 -1.00540370e-02
  -8.12405608e-02  8.93335163e-02 -9.14237336e-02  8.45300921e-02
   1.37499298e-01  1.11244025e-01 -1.61642905e-01 -8.22439493e-02
   8.37148260e-01  2.72442207e-01 -1.51805844e-02 -7.66778803e-02
   5.60355480e-02  8.33248999e-02]
 [ 3.08438338e-01 -1.16876769e-02  9.00330455e-02  7.99117560e-02
   7.47379231e-02 -7.25853857e-02  1.04875746e-01 -2.16815347e-01
  -2.59988735e-01  1.24837047e-01 -4.29365477e-03 -3.50089602e-01
   2.42295907e-01  2.61394487e-03  4.61164909e-01  5.23226723e-01
  -2.65096114e-01 -1.36447171e-02]
 [-3.07548493e-01  8.40915278e-02 -1.11063547e-01  1.60464922e-02
  -7.75020996e-02  9.60554272e-02 -9.06723384e-02  3.37069994e-02
   1.03269951e-01  2.11468012e-01 -2.40841717e-01 -3.42527317e-01
  -9.86931593e-02 -6.84892390e-01  2.18872117e-01  2.39504315e-02
   2.70709305e-01  1.72817545e-01]
 [-2.76301073e-01  1.25836631e-01  2.19877688e-01  6.66507863e-02
   2.46140560e-01 -6.35014904e-02  3.49667685e-01 -2.26684736e-01
   2.44776407e-01  3.87473859e-01  2.24580349e-01  3.05154380e-02
  -1.40549391e-02  4.47385929e-02  1.53765067e-01 -1.04419937e-01
   1.53673085e-01 -5.43122947e-01]
 [-3.02748114e-01  7.01998575e-02 -1.44818765e-01 -6.98045095e-02
  -1.49584067e-01  1.34458896e-01 -7.54753072e-02  1.45772665e-01
   5.85239946e-02 -1.47036092e-01  2.06902072e-02  2.33368955e-01
   1.43866319e-02 -2.54510995e-01  1.79499013e-01  1.16604375e-02
  -7.26163025e-01 -3.24937516e-01]
 [-3.07040626e-01  7.79336637e-02 -1.15323952e-01 -1.73631584e-02
  -1.15117310e-01  1.26968672e-01 -6.99641470e-02  5.32611781e-02
   1.28904560e-01  1.60305310e-01 -1.96322990e-01 -2.75169550e-01
  -4.75672122e-01  6.13103868e-01  2.20362642e-01  7.99305617e-02
  -1.22815848e-01  1.42051799e-01]
 [-2.61520489e-01  2.09927277e-01  2.13627435e-01 -7.22457181e-02
   7.54871674e-03 -7.33961842e-02  4.55851958e-01  1.58194670e-01
  -3.37170589e-01 -5.87690102e-01 -2.58436921e-01 -1.07063554e-01
   8.61256926e-03  4.41891377e-02  1.43753708e-01 -5.21969873e-02
   1.69567965e-01 -8.32177228e-02]
 [ 4.36323635e-02  5.03914450e-01 -6.73920886e-02 -1.35860558e-01
  -1.40527774e-01 -1.31928871e-01 -7.90311042e-02 -3.00374428e-01
   5.01365221e-01 -3.87030017e-01  2.27875444e-01 -1.38958435e-01
   7.55464886e-03 -1.59765660e-02 -1.34656976e-01  3.04769192e-01
   5.39469506e-02  3.01217731e-02]
 [-3.67057041e-02 -1.45682524e-02  5.21623444e-01  4.90121679e-01
  -5.89800103e-01 -3.12415086e-01 -1.30187397e-01  1.14687509e-01
   7.50393829e-02  5.41502565e-02 -1.39861362e-02  5.61401152e-03
  -2.19811008e-03 -5.03222786e-03 -1.37166771e-02 -4.76724453e-03
  -3.27151282e-02 -2.14301813e-02]
 [-5.88504115e-02 -9.33980545e-02 -6.87170643e-01  3.80232477e-01
  -1.27793729e-01 -4.82506903e-01  3.10629290e-01 -1.18168951e-01
  -3.07213623e-02 -1.36044539e-02 -1.77010708e-02  8.59021362e-02
  -1.39575997e-02  1.10992435e-02  2.72433694e-02 -2.97178011e-02
   1.82173722e-02  1.83842486e-02]
 [-3.48373860e-02 -5.01664210e-01  6.22069465e-02 -3.55391597e-02
  -1.81582693e-01  2.75222340e-01  2.59557864e-01 -7.27008273e-02
   3.62122453e-01 -2.20343289e-01  1.73696003e-01  2.79657886e-01
   3.82401827e-02  7.76499049e-03  4.14581122e-01  1.14797284e-01
   1.66961820e-01  2.41026732e-01]
 [-8.28136172e-02 -5.06546563e-01  4.08035393e-02  1.03008417e-01
   1.11256244e-01  6.05771535e-02  1.76348774e-01  1.81034286e-02
   2.40710780e-01 -1.71416688e-01 -7.22825606e-02 -5.36171185e-01
   3.98716359e-03 -4.78049584e-02 -4.65683959e-01  8.53480643e-02
  -1.96223612e-01 -1.78387852e-01]]

Eigenvalues 
[9.79297570e+00 3.37710644e+00 1.20873054e+00 1.13659560e+00
 8.96286859e-01 6.58293128e-01 3.23056525e-01 2.26906613e-01
 1.12741686e-01 7.62069059e-02 6.18393099e-02 4.42420969e-02
 3.12610726e-03 1.01216098e-02 2.99919142e-02 2.67735138e-02
 1.77191935e-02 1.94537446e-02]
In [26]:
eigen_pairs = [(np.abs(e_vals[index]), e_vecs[:,index]) for index in range(len(e_vals))]

# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eigen_pairs.sort(reverse=True)
eigen_pairs[:18]
Out[26]:
[(9.792975698382953,
  array([-0.27225105, -0.28537005, -0.30148623, -0.27259451, -0.09857976,
         -0.19475579, -0.31051844,  0.30843834, -0.30754849, -0.27630107,
         -0.30274811, -0.30704063, -0.26152049,  0.04363236, -0.0367057 ,
         -0.05885041, -0.03483739, -0.08281362])),
 (3.377106439893973,
  array([-0.08972848,  0.13317394, -0.04402596, -0.20423223, -0.25913686,
         -0.09457563,  0.07233508, -0.01168768,  0.08409153,  0.12583663,
          0.07019986,  0.07793366,  0.20992728,  0.50391445, -0.01456825,
         -0.09339805, -0.50166421, -0.50654656])),
 (1.2087305396351002,
  array([ 0.02260451,  0.21080994, -0.07087808, -0.04021396,  0.11480523,
          0.13931348, -0.1129247 ,  0.09003305, -0.11106355,  0.21987769,
         -0.14481876, -0.11532395,  0.21362744, -0.06739209,  0.52162344,
         -0.68717064,  0.06220695,  0.04080354])),
 (1.136595602176694,
  array([ 0.13041903, -0.02067855,  0.10742522, -0.25295734, -0.605228  ,
          0.32253141, -0.01005404,  0.07991176,  0.01604649,  0.06665079,
         -0.06980451, -0.01736316, -0.07224572, -0.13586056,  0.49012168,
          0.38023248, -0.03553916,  0.10300842])),
 (0.8962868592787947,
  array([-0.15232414,  0.13902259,  0.08073354, -0.11901255, -0.08321282,
          0.62137607, -0.08124056,  0.07473792, -0.0775021 ,  0.24614056,
         -0.14958407, -0.11511731,  0.00754872, -0.14052777, -0.5898001 ,
         -0.12779373, -0.18158269,  0.11125624])),
 (0.6582931281646512,
  array([ 0.25837458, -0.06889799, -0.02048009, -0.13944968, -0.58714549,
         -0.26562469,  0.08933352, -0.07258539,  0.09605543, -0.06350149,
          0.1344589 ,  0.12696867, -0.07339618, -0.13192887, -0.31241509,
         -0.4825069 ,  0.27522234,  0.06057715])),
 (0.3230565251079227,
  array([-0.18879422,  0.39087124, -0.17638455, -0.15647445, -0.10249295,
         -0.39885179, -0.09142373,  0.10487575, -0.09067234,  0.34966769,
         -0.07547531, -0.06996415,  0.45585196, -0.0790311 , -0.1301874 ,
          0.31062929,  0.25955786,  0.17634877])),
 (0.2269066128235807,
  array([-0.77157824, -0.06605284,  0.29869388,  0.05204104, -0.1618725 ,
         -0.0585801 ,  0.08453009, -0.21681535,  0.033707  , -0.22668474,
          0.14577266,  0.05326118,  0.15819467, -0.30037443,  0.11468751,
         -0.11816895, -0.07270083,  0.01810343])),
 (0.11274168632338624,
  array([-0.36178478, -0.04629576, -0.2644992 , -0.17043033,  0.01172123,
          0.17321317,  0.1374993 , -0.25998873,  0.10326995,  0.24477641,
          0.05852399,  0.12890456, -0.33717059,  0.50136522,  0.07503938,
         -0.03072136,  0.36212245,  0.24071078])),
 (0.0762069059326686,
  array([-0.12523363,  0.24026261, -0.09429718,  0.08970625,  0.02875286,
         -0.24993762,  0.11124403,  0.12483705,  0.21146801,  0.38747386,
         -0.14703609,  0.16030531, -0.5876901 , -0.38703002,  0.05415026,
         -0.01360445, -0.22034329, -0.17141669])),
 (0.061839309866481375,
  array([ 0.02920095,  0.07295032,  0.77875503, -0.13164708,  0.04975346,
         -0.19844446, -0.16164291, -0.00429365, -0.24084172,  0.22458035,
          0.02069021, -0.19632299, -0.25843692,  0.22787544, -0.01398614,
         -0.01770107,  0.173696  , -0.07228256])),
 (0.04424209694975977,
  array([ 0.00076244,  0.19379992, -0.23264905,  0.2751439 , -0.14555863,
          0.1726002 , -0.08224395, -0.3500896 , -0.34252732,  0.03051544,
          0.23336895, -0.27516955, -0.10706355, -0.13895843,  0.00561401,
          0.08590214,  0.27965789, -0.53617119])),
 (0.02999191420611325,
  array([-0.10140749, -0.31133782,  0.05891668, -0.20457498,  0.15089389,
          0.17605501, -0.01518058,  0.46116491,  0.21887212,  0.15376507,
          0.17949901,  0.22036264,  0.14375371, -0.13465698, -0.01371668,
          0.02724337,  0.41458112, -0.46568396])),
 (0.026773513807314915,
  array([-0.14632686,  0.19646365,  0.0533932 ,  0.65891658, -0.28961083,
          0.0668512 , -0.07667788,  0.52322672,  0.02395043, -0.10441994,
          0.01166044,  0.07993056, -0.05219699,  0.30476919, -0.00476724,
         -0.0297178 ,  0.11479728,  0.08534806])),
 (0.01945374459814113,
  array([ 0.00332992,  0.58399614,  0.08641601, -0.27130049,  0.09640173,
          0.11084147,  0.0833249 , -0.01364472,  0.17281754, -0.54312295,
         -0.32493752,  0.1420518 , -0.08321772,  0.03012177, -0.02143018,
          0.01838425,  0.24102673, -0.17838785])),
 (0.017719193496813643,
  array([-0.00381639, -0.29623072,  0.09727353,  0.27490099, -0.11910007,
         -0.02929594,  0.05603555, -0.26509611,  0.2707093 ,  0.15367309,
         -0.72616302, -0.12281585,  0.16956796,  0.05394695, -0.03271513,
          0.01821737,  0.16696182, -0.19622361])),
 (0.010121609778869617,
  array([ 0.01059837, -0.08717666,  0.02287243,  0.02906688, -0.00940949,
          0.01209805,  0.27244221,  0.00261394, -0.68489239,  0.04473859,
         -0.25451099,  0.61310387,  0.04418914, -0.01597657, -0.00503223,
          0.01109924,  0.00776499, -0.04780496])),
 (0.0031261072615284837,
  array([-0.01066806, -0.00774671,  0.01119057, -0.03746892,  0.02098421,
         -0.01068883,  0.83714826,  0.24229591, -0.09869316, -0.01405494,
          0.01438663, -0.47567212,  0.00861257,  0.00755465, -0.00219811,
         -0.0139576 ,  0.03824018,  0.00398716]))]
In [27]:
# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eigen_pairs[index][0] for index in range(len(e_vals))]
eigvectors_sorted = [eigen_pairs[index][1] for index in range(len(e_vals))]

# Let's confirm our sorting worked, print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigvalues_sorted)
Eigenvalues in descending order: 
[9.792975698382953, 3.377106439893973, 1.2087305396351002, 1.136595602176694, 0.8962868592787947, 0.6582931281646512, 0.3230565251079227, 0.2269066128235807, 0.11274168632338624, 0.0762069059326686, 0.061839309866481375, 0.04424209694975977, 0.02999191420611325, 0.026773513807314915, 0.01945374459814113, 0.017719193496813643, 0.010121609778869617, 0.0031261072615284837]
In [28]:
# the "cumulative variance explained" analysis 
tot = sum(e_vals)
var_exp = [( i /tot ) * 100 for i in sorted(e_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
Cumulative Variance Explained [ 54.33850121  73.07712653  79.78403645  86.09068965  91.0639364
  94.71662207  96.50917296  97.76821471  98.39378701  98.81663795
  99.1597671   99.40525421  99.571671    99.72022979  99.82817322
  99.9264921   99.9826541  100.        ]
In [29]:
# Plotting the variance expalained by the principal components and the cumulative variance explained.
plt.figure(figsize=(10 , 5))
plt.bar(range(1, e_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, e_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

From above plot we can see that 8 dimensions are able to explain 95 %variance of data. So we will use first 8 principal components going forward and calulate the reduced dimensions.

Dimensionality Reduction

Now 8 dimensions seems very reasonable. With 8 variables we can explain over 95% of the variation in the original data.

In [30]:
#dim_reduce represents reduced mathematical space.

dim_reduce = np.array(eigvectors_sorted[0:8])   #Reducing from 8 to 4 dimension space

XScaled_pca = np.dot(XScaled,dim_reduce.T)   #projecting original data into principal component dimensions

dim_reduce = pd.DataFrame(XScaled_pca)  #converting array to dataframe for pairplot

dim_reduce
Out[30]:
0 1 2 3 4 5 6 7
0 -0.591125 -0.655523 0.564477 -0.659870 0.855251 -1.835814 0.155983 -0.683144
1 1.524878 -0.327117 0.251528 1.296236 0.282463 -0.091649 -0.209862 0.127745
2 -3.969982 0.239514 1.229875 0.180391 -0.919360 -0.650638 -0.826445 0.163185
3 1.549729 -3.037566 0.466449 0.394413 0.623392 0.383794 -0.131539 -0.176248
4 -5.468963 4.651385 -1.290061 0.023804 -1.692033 2.510965 -0.315330 0.475009
... ... ... ... ... ... ... ... ...
808 0.368201 -0.641878 -1.481101 0.164090 -0.777381 -0.934650 -0.874360 0.193428
809 0.040917 -0.160848 -0.473839 -0.179208 1.978454 -1.431609 0.279248 -0.302916
810 -5.188919 -0.171319 0.585738 -0.886837 1.348744 0.225891 -0.888525 -0.429704
811 3.321748 -1.094132 -1.930953 0.339361 0.527587 -0.030116 0.265542 0.451123
812 5.012853 0.432697 -1.315713 0.196398 0.167606 0.345863 0.409124 -0.221262

813 rows × 8 columns

In [31]:
sns.pairplot(dim_reduce, diag_kind='kde') 
Out[31]:
<seaborn.axisgrid.PairGrid at 0x1d8dee76408>

It is clealry visible from the pairplot above that after dimensionality reduction using PCA, attributes have become independent with no correlation among themselves. As most of them have cloud of data points with no linear kind of relationship.

7) Spliting the data

We will use 70% of data for training and 30% for testing.

In [32]:
from sklearn.model_selection import train_test_split

#orginal Data
X_train, X_test, y_train, y_test = train_test_split(XScaled, y, test_size = 0.30, random_state = 1)

#PCA Data
pca_X_train,pca_X_test,pca_y_train,pca_y_test = train_test_split(dim_reduce, y, test_size=0.30, random_state = 1)

Let's train the model with original data and pca data with new dimensions

In [33]:
from sklearn.svm import SVC

clf = SVC()
clf.fit(X_train, y_train)

#predict the y value
y_pred = clf.predict(X_test)

print ('Before PCA score', clf.score(X_test, y_test))
Before PCA score 0.9795081967213115
In [34]:
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

# Classification Report
print(classification_report(y_test, y_pred)) 

# Creates a confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred) 

# Transform to df for easier plotting
df_cm = pd.DataFrame(cm,
                     index = ['van','car','bus'], 
                     columns = ['van','car','bus'])

plt.figure(figsize=(9,6))
sns.heatmap(df_cm, annot=True, cmap='YlGnBu', fmt='g')
plt.title('Accuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99        78
         1.0       0.98      0.98      0.98       121
         2.0       0.98      0.93      0.95        45

    accuracy                           0.98       244
   macro avg       0.98      0.97      0.98       244
weighted avg       0.98      0.98      0.98       244

An insight we can get from the matrix is that the SVC model was very weak at classifying 'van' (True Positive/All = 0.10), However, accuracy for 'car'(93/121= 0.77) and 'bus'(39/45= 0.87) was slightly better.

After PCA score

In [35]:
clf.fit(pca_X_train, pca_y_train)

#predict the y value
pca_y_pred = clf.predict(pca_X_test)

print ('After PCA score', clf.score(pca_X_test, pca_y_test))
After PCA score 0.9631147540983607
In [36]:
# Classification Report
print(classification_report(pca_y_test, pca_y_pred)) 

# Creates a confusion matrix
cm = metrics.confusion_matrix(pca_y_test, pca_y_pred) 

# Transform to df for easier plotting
df_cm = pd.DataFrame(cm,
                     index = ['van','car','bus'], 
                     columns = ['van','car','bus'])

plt.figure(figsize=(9,6))
sns.heatmap(df_cm, annot=True, cmap='YlGnBu', fmt='g')
plt.title('Accuracy:{0:.3f}'.format(accuracy_score(pca_y_test, pca_y_pred)))
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97        78
         1.0       0.97      0.96      0.96       121
         2.0       0.98      0.91      0.94        45

    accuracy                           0.96       244
   macro avg       0.96      0.96      0.96       244
weighted avg       0.96      0.96      0.96       244

An insight we can get from the matrix is that the SVC model after PCA was very accurate at classifying 'van' (True Positive/All = 1.0). However, accuracy for 'car' (116/121= 0.96) and 'bus' (41/45= 0.91) was lower.

8) Comparing the scores

In the given dataset we trained models with the orginal and dimensionally reduced datasets.

- For SVM model , We got 98% accuracy with original data.

- With PCA, we got 96% accuracy. The effects of PCA can be clearly appreciated on a datsaset.

Dimensionality Reduction plays a really important role in machine learning, especially when you are working with more number of features. Principal Components Analysis are one of the top dimensionality reduction algorithm and easy to understand.

The original dataset was composed of 18 features x 846 rows. After applying a Principal Components Analysis, I discovered that only 8 principal components were enough to keep 96% accuracy score!